Appropriate for summarizing a set of numbers (continous variables)
Choose a bin size and a center value, e.g. one hour bins centered at the integers would be denoted as \((5.5, 6.5]\), \((6.5, 7.5]\), \((7.5, 8.5]\), etc. Bins will be non-overlapping. Calculate enough bins to completely cover data
Assign each runner to a bin, e.g. 13.50 goes into the \((12.5, 13.5]\) bin and 13.51 goes in to the \((13.5, 14.5]\) bin
Plot bars for each bin, with the height of the bar corresponding to the number of runners in that bin
ggplot(ultrarunning) +geom_histogram(aes(x = pb100k_dec), binwidth =1, center =10, fill ="grey", color ="black") +labs(x ="Personal best time (hours)",y ="Count") +theme(text =element_text(size =24))
Bin width of 10 hours – too large
Bin width of 3 minutes – too small
Density plot
Alternative to summarizing continuous variable
Smoothed version of histogram (but amount of smoothing is adjustable)
\(y\)-axis is density: single connected line and area under the line equals 1
Different amounts of smoothing
Comparison
Histogram
More familiar to most readers
\(y\)-axis is counts by default (but technically these should be densities*)
Requires choosing binwidth
Density plot
Less familiar to readers
\(y\)-axis is density
Algorithms to choose appropriate amount of smoothness
Generally no reason not to show both.
*Density \(\neq\) probability (but you can think of it as relative probability)
ggplot(ultrarunning) +geom_histogram(aes(x = pb100k_dec, y =after_stat(density)), binwidth =1, center =10, fill ="grey", color ="black") +geom_density(aes(x = pb100k_dec)) +labs(x ="Personal best time (hours)",y ="Density") +theme(text =element_text(size =24))
Mary E Spear
American graphic analyst for the US government for more than 30 years
Author of Charting Statistics (1952) and Practical Charting Techniques (1969)
ggplot(ultrarunning) +geom_boxplot(aes(x = pb100k_dec), outlier.shape =NA) +geom_jitter(aes(x = pb100k_dec, y =0), width =0, height =0.33) +scale_x_continuous(name ="Personal best time (hours)") +scale_y_continuous(breaks =NULL, name =NULL, limits =c(-1, 1)) +theme(text =element_text(size =24))
Random vertical jitter allows to see multiple datapoints with same value
Violin plots
More recent introduction is ‘violin plot’: density plot and its reflection
Easy to create and show for multiple variables at once, like boxplots, but better provides more accurate representation of distribution, like density plot
# See how pb_surface_name and sex_char are created on slide 31ggplot(ultrarunning) +geom_bar(aes(x = pb_surface_name, fill = sex_char)) +labs(x ="Surface type",y ="Count",fill ="Sex") +theme(text =element_text(size =24))
# See how pb_surface_name and sex_char are created on slide 31ggplot(ultrarunning) +geom_bar(aes(x = pb_surface_name, fill = sex_char), position ="fill") +labs(x ="Surface type",y ="Proportion",fill ="Sex") +theme(text =element_text(size =24))
# See how pb_surface_name and sex_char are created on slide 31# preserve = "single" maintains a constat bar width, even when there are empty# categoriesggplot(ultrarunning) +geom_bar(aes(x = pb_surface_name, fill = sex_char), position =position_dodge(preserve ="single")) +labs(x ="Surface type",y ="Count",fill ="Sex") +theme(text =element_text(size =24))
Stacked vs. Filled vs. Dodged
All are different flavors of bar charts
Stacked
Emphasizes overall count differences between bars
Difficult to assess subgroup count differences between bars (except first subgroup)
Difficult to count small categories
Filled
Easy to assess overall proportional differences between bars for first and last subgroups
Difficult to assess subgroup proportional differences between bars in interior subgroups
No information on counts
Column plot
Emphasis on count differences between bars for each subgroup
Difficult to assess overall count differences between bars
Difficult to count small categories
Bars vs. Histograms vs. Columns
Bar charts / Bar plots
Only appropriate for categorical variables
Bars are categories. Nothing exists between bins
Bars may be ordered by value or by count
\(y\)-axis can be counts or proportions
Use geom_bar()
Histogram
Only appropriate for continuous variables
Bins are intervals based on binwidth. No gap between bins
ggplot(ultrarunning, aes(x = pb_surface_name, y = pb100k_dec)) +geom_violin(fill ="grey70") +geom_boxplot(width =0.25) +scale_y_continuous(name ="Personal best time (hours)") +scale_x_discrete(name =NULL) +theme(text =element_text(size =24))
Layering in ggplot
Adding multiple geometric objects to one ggplot object results in multiple layered views of the information.
Example of jittered points over boxplot:
ggplot(ultrarunning) +1geom_boxplot(aes(x = pb100k_dec), outlier.shape =NA) +2geom_jitter(aes(x = pb100k_dec, y =0), width =0, height =0.33) +scale_x_continuous(name ="Personal best time (hours)") +scale_y_continuous(breaks =NULL, name =NULL, limits =c(-1, 1)) +theme(text =element_text(size =24))
1
First create boxplot
2
Then add points across \(y=0\) line with random vertical jitter
Other examples of layering we’ve seen so far: layering density over histogram of running times (slide 15); data points over a boxplot (slide 24); boxplot over violin plot (slide 27)
When to layer?
Some situations when you would want to layer:
When single layer has critical weakness / deficiency (Avoid distorting what the data say)
When you want to highlight both granular and aggregate components of the data (Reveal the data at several levels of detail, from broad overview to fine structure)
To anchor your data (layer A) within context of reference / other data (layer B) (Encourage comparison between different pieces of data)
Medium Spice: Make the layered violin plots on slide 27 or slide 28;
Yoga Flame: Make the layered Density+Histogram on slide 15 (hint: use after_stat to get the correct y-axis for the histogram); or the layered boxplot on slide 25 (hint: use geom_jitter instead of geom_point); or one of the barcharts on slides 35, 36, or 37 (hint: you will need to use case_when to create some character variables before making the plots);
Dim Mak: Make one of the bivariate violin plots on slides 40, 41, or 42;
References
Samtleben, E., 2023. Ultrarunning dataset. Teaching of Statistics in the Health Sciences Resource Portal, Available at https://www.causeweb.org/tshs/ultra-running/.